Google Cloud Datalab allows you to use SQL to interact with Google BigQuery. However, SQL is just the starting point. Datalab enables you to combine the power of declarative SQL with imperative code (Python) to perform interesting data analysis, visualization, and transformation tasks.
You can use a number of Python data analysis, data wrangling, and visualization libraries, such as numpy
, pandas
, matplotlib
, and many others. Several of these libraries build on top of a DataFrame
object.
This notebook shows how to easily integrate these capabilites together in a single notebook. This functionality is provided by the BigQuery APIs, so the first step is, of course, to import the module along with pandas.
In [4]:
import google.datalab.bigquery as bq
import pandas as pd
In the BigQuery APIs notebook, we've covered how to define a SQL query, execute it, and get a reference to the results in the form of a DataFrame
. Let's start with a query to see what more we can do.
In [1]:
%%bq query -n requests
SELECT timestamp, latency, endpoint
FROM `cloud-datalab-samples.httplogs.logs_20140615`
WHERE endpoint = 'Popular' OR endpoint = 'Recent'
In [2]:
%%bq sample --count 5 --query requests
Out[2]:
In [5]:
df = requests.execute(output_options=bq.QueryOutput.dataframe()).result()
len(df)
Out[5]:
The resulting DataFrame
can be further transformed, sliced, projected, and inspected using the extensive set of APIs available on the class. Let's print out the first five rows.
In [6]:
df.head(5)
Out[6]:
Or it can be inspected for schema,
In [7]:
df.dtypes
Out[7]:
or further transformed locally, for example to perform grouping,
In [8]:
groups = df.groupby('endpoint')
groups.dtypes
Out[8]:
In [9]:
for name, df_group in groups:
print('%s - %d items' % (name, len(df_group)))
print(df_group.head(3))
print()
and then analyze a dimension per group,
In [10]:
groups['latency'].describe()
Out[10]:
or even run a set of custom aggregation functions.
In [11]:
def unique(x):
return sorted(set(list(x)))
groups['latency'].agg({ 'list': lambda x: list(x), 'unique': lambda x: unique(x) })
Out[11]:
Conversion to TimeSeries Data
Our DataFrame instance contains timestamps, latencies, and endpoints. Let's reshape this DataFrame so that each endpoint is a column, that is, an independent series. The timestamp, itself, will be the index row. For timestamps for which one of the endpoints has no data point, we'll use the previous value.
In [16]:
df_series = df.pivot(index='timestamp', columns='endpoint', values='latency').fillna(method = 'backfill')
df_series[10:20]
Out[16]:
In [13]:
len(df_series)
Out[13]:
In [20]:
df_series.plot(logy = True)
Out[20]:
Resampling
The plot above is not meaningful. Our time-series data is taken arbitrary timestamps, whenever a request was processed. It is irregularly spaced, and there are a large number of data points. DataFrames provide the ability to resample a time-series into more meaningful time windows.
In [21]:
df_series.resample(rule='10min').mean().plot(logy = True)
Out[21]:
Python pandas provide an extensive toolbox of capabilities for working with data. Datalab combines the power of BigQuery and DataFrames.
Subsequent notebooks cover additional SQL capabilities that you to use the full power of BigQuery, allowing you to run queries close to your entire dataset before bringing a useful subset of data into the notebook.